Introduction to data mining:
Data
mining is an automated process of searching data for
relationships and patterns.
Data
mining is often done under the supervision of a human agent.
The computer using data mining algorithms will identify
possible patterns in data and the human agent will visually
check to see if the patterns are relevant.
Applications
of data mining include credit risk assessment in financial
institutions, fraud detection in credit card companies, sales
analysis for retail industries and any company that collects
large amounts of historical data.
Data
mining is also very useful whenever the size of data collected
in a database makes manual data analysis tedious.
Data
mining can be used to improve revenue and reduce costs in a
company. An example is a company that uses postal mail to
market products to customers. This company can use data mining
to select the attributes of customers that best predict if
they will buy a product from a mailing campaign. Traditionally
the company may only be using customer’s income to target
the mailings. However the data mining process may discover
that marital status, age and income are much better predictors
of the chances of a customer buying a product from a mailing
campaign than income alone. Based on this information the
mailing company could target their mailing to a more select
group of customers, thereby reducing their costs and
increasing their profits. Also the data mining process may
discover that there is also a cluster of highly affluent
single parents in a particular location. This cluster of
customers may be responding to the mailings at an unusually
high rate. This could alert the marketing manager of the
mailing company to the existence of a highly profitable market
segment for the company.
Note:
data mining does not replace the need for a good
understanding of the data and business, and sound judgment in
making evaluations between the spurious and significant
knowledge discoveries from your data.
The
increasing availability of cheaper and more powerful computer
processing power means that the ability to analyze even
terabytes of data is affordable to many companies. Although
the data mining process for very large databases could take
weeks on a slow computer, the application of powerful multi
processor capable networked computers with gigabytes of memory
and gigabytes of processing power to the very large databases
(VLDB) is providing satisfactory performance for a lot of
cases.
Although
there are data mining tools that you could use without knowing
the details of how it works, the knowledge could at least be
valuable in selecting between the various data mining tools
that are available.
There
are many data mining algorithms and techniques used by the
machine learning community and one of them known as One Rule
(1Rule) is implemented below in Visual Basic.
Note:
This article presents the Pseudo Code for a data-mining
algorithm and the Visual Basic implementation of the
algorithm. The Visual Basic algorithm may be implemented for
any Relational Database Management System (RDBMS) including
Microsoft® Access®, Microsoft® SQL Server, Oracle® and
Sybase® databases.
Author’s Bio:
The
author is an engineering graduate specializing in database and
data mining programming and consulting. He can be reached on
email at public@msaccessguru.com
or at his website http://www.msaccessguru.com/.
Full Source Code:
THE FULL
SOURCE CODE AND FILES FOR THIS ARTICLE ARE AVAILABLE FREE AT MSACCESSGURU.COM.
Legal Notice:
This
article and accompanying algorithm and source code is
distributed without any warranty as to the fitness of the
algorithm, article or source code. The use or distribution
of this article, algorithm and source code is free provided
that it is used or distributed along with this article and
without any changes to the article. No claim is made as to
the accuracy or fitness of this algorithm or source code. The
use of this algorithm, article or source code is at your own
risk and choice and the author is not liable in anyway for its
use or damages that may occur as a result of its use.
Msaccessguru.com
retains full copyrights over the article and the accompanying
source code.
1Rule (1R) algorithm for data mining:
1Rule
creates one data mining rule for your data based on one
attribute (column in a database table). It chooses the rule
that gives the lowest classification error after comparing the
error rates from all the attributes. The rule will assign each
distinct value of only one chosen attribute to one category or
class. This rule can be defined in Pseudo Code as :
For
each attribute in the data set
For each
distinct value of the attribute
Find the most frequent classification
Assign the classification to the value
Calculate the error rate for the value
Calculate
the total error rate for the attribute
Choose
the attribute with the lowest error rate
Create
one rule for the chosen attribute
A Visual Basic Data Mining Implementation (The Lenses
Database):
The data
used is from a lenses database2
for fitting contact lenses. This database was obtained from
the UCI Machine Learning Repository1.
The Lenses database has 5 attributes (columns in a database table), 24 instances (rows in the table) of data and 3 classes (hard contact lenses, soft contact lenses or no contact lenses).
The goal of One Rule (1Rule) data mining in this implementation is to classify each of the attributes Patient_Age, Spectacle_Prescription, Astigmatic and Tear_Production_Rate of the Lenses database as No_contact_lenses, Soft_contact_lenses or Hard_contact_lenses.
The
source files includes 3 files named Index, lenses.data and
lenses.name.
lenses.data
is a data file containing all possible numeric combinations of
the attributes (columns in a database table) values.
The
Lenses data was imported from Lenses.data into the dbo_lenses
table in Microsoft® Access®. The column names and data
values where obtained from the file lenses.name.
The
numeric data in the Microsoft® Access® table dbo_lenses was
converted to a lenses table containing descriptive labels for
the data using values from lenses.names.
The
dbo_lenses table or lenses table is used as the final data set
needed by the Visual Basic implementation of 1Rule data mining
algorithm.
Data
Mining Diagram For The Lenses Data:
Figure 1
The first
step to implementing 1Rule is to note which attributes
from the lenses table (see Appendix A: Figure 3) will be used
to create the best one rule for the data set. Patient_Age,
Spectacle_Prescription, Astigmatic and Tear_Production_Rate
were chosen. Row_Number is not chosen because it is
the Primary Key or Row Identifier for the lenses table. Contact_Lenses
is not chosen because it contains the categories (No_contact_lenses,
Soft_contact_lenses or Hard_contact_lenses) used
to classify the data. Example Patient_Age attribute
(column in lenses table) is classified as No_contact_lenses.
The
second step is to list the distinct values of
each attribute. For the lenses table (see Appendix A: Figure
3),
Patient_Age
attribute (column in the lenses table) has 3 distinct
values:
Young,
Pre-presbyopic and Presbyopic.
Spectacle_Prescription
attribute (column in the lenses table) has 2 distinct
values:
Myope and
Hypermetrophe.
Astigmatic
attribute
(column in the lenses table) has 2 distinct
values:
No and Yes.
Tear_Production_Rate
attribute (column in the lenses table) has 2 distinct
values:
Reduced and
Normal.
The
third step is to find the most frequent
classification for each distinct value of an attribute’s
using the Contact_Lenses attribute values (No_contact_lenses,
Soft_contact_lenses or Hard_contact_lenses).
Patient_Age
attribute (column in the lenses table):
Young has 8 instances (rows in lenses table)
classified as:
No_contact_lenses =
4 instances (rows)
Soft_contact_lenses =
2 instances (rows)
Hard_contact_lenses =
2 instances (rows)
The most frequent classification is No_contact_lenses
(4 instances).
We
then make a rule classifying the Young attribute value as No_contact_lenses
(Young
à
No_Contact_Lenses).
The
Error Rate for theYoung attribute is the number of
times it appears in the data set (8 instances) – the number
of instances of it’s most frequent class (4 instances).
Error
Rate for the Young attribute is 4/8.
The
next step is to repeat the calculations above for the Pre-presbyopic
value.
Pre-presbyopic has 8 instances (rows in lenses table) classified as:
No_contact_lenses =
5 instances (rows)
Soft_contact_lenses =
2 instances (rows)
Hard_contact_lenses =
1 instance (row)
The
most frequent classification is No_contact_lenses (5
instances).
We
then make a rule classifying the Pre-presbyopic
attribute value as No_contact_lenses
(Pre-presbyopic
à
No_Contact_Lenses).
The
Error Rate for the Pre-presbyopic attribute is the
number of times it appears in the data set (8 instances) –
the number of instances of it’s most frequent class (5
instances).
Error
Rate for the Pre-presbyopic attribute is 3/8.
The final step is to repeat the calculations above for
the Presbyopic value.
Presbyopic
has 8 instances (rows in lenses table) classified as:
No_contact_lenses =
6 instances (rows)
Soft_contact_lenses =
1 instance (row)
Hard_contact_lenses =
1 instance (row)
The
most frequent classification is No_contact_lenses (6
instances).
We
then make a rule classifying the Presbyopic attribute
value as No_contact_lenses
(Presbyopic
à
No_Contact_Lenses).
The
Error Rate for the Presbyopic attribute is the number
of times it appears in the data set (8 instances) – the
number of instances of it’s most frequent class (6
instances).
Error
Rate for the Presbyopic attribute is 2/8.
The
Total Error Rate for the Patient_Age attribute is = 4/8
+ 3/8 + 2/8.
The
Total Error Rate for the Patient_Age attribute is = (4
+ 3 + 2)/24
The
Total Error Rate for the Patient_Age attribute is 9/24.
The
fourth step is to repeat the above three steps
for the remaining attributes Spectacle_Prescription, Astigmatic
and Tear_Production_Rate to obtain their Classification
and Total Error Rates.
The
Total Error Rates for each attribute:
Patient_Age
9/24
Spectacle_Prescription
9/24
Astigmatic
9/24
Tear_Production_Rate
7/24
The
fifth step is to choose the attribute with lowest
error rate.
Tear_Production_Rate
has the lowest
error rate of 7/24.
The
final step is to create One Rule (1Rule) based
on this attribute. Classifying the attribute Tear_Production_Rate
using the methodology of the previous exercise for Patient_Age
attribute (first step to third step) gives the classification
rule:
Reduced à No_contact_lenses and Normal à Soft_contact_lenses.
The One Rule (1Rule) chosen for the lenses table is:
Tear_Production_Rate
Reduced à
No_contact_lenses
Normal à
Soft_contact_lenses
This
process could become very tedious for a few thousands rows of
data and has been automated in Visual Basic as described in
the rest of the article.
Visual Basic procedure for data mining based on One Rule (1R):
The
Visual Basic project is implemented as 2 classes and 2
modules.
The
classes are cDataAccess and c1Rule and the modules are modMain
and modDataMining.
Descriptions
of the classes and modules:
modDataMining:
This
module implements the procedure Create_1Rule() that
creates One Rule (1R) for the data.
Create_1Rule()
initializes the SQL string used to access the Relational
Database Management System (RDBMS) and the ADO connection
string used to open a valid ADO connection to the Relational
Database Management System (RDBMS). It then executes the SQL
string against the RDBMS to create an ADO recordset using the ExecuteSQL()
function of the cDataAccess class.
Create_1Rule()
then makes a call
to the ChooseBestRule() procedure of the c1Rule
class. This procedure creates a rule describing the attribute
(column in a database table) that has the lowest error rate.
The
results of the call are then saved to XML with a call to the WriteToXML()
function of the modMain module.
Visual
Basic Data Mining with 1Rule:
Figure
2
cDataAccess:
This is the data access
class. It uses the Microsoft® ActiveX Data Objects 2.1
library (ADO 2.1 or higher library) from msadotlb21.tlb.
The main purpose of the
class is to create an ADO connection to any Relational
Database Management System (RDBMS) like Microsoft® Access®
or Microsoft® SQL Server for which there is an ADO provider.
A call to the ExecuteSQL()
function returns an ADO recordset. The
function call accepts the following parameters;
QueryString
– a SQL SELECT string
Connect
- an ADO connection string
Conn
- an ADO connection
CursorLocation
-
the location of the cursor (client or server)
CursorType
-
the type of cursor to be used
LockType
- The type of lock
placed on records during editing
ReturnRecordset
-
ADO recordset returned form the RDBMS
The
ExecuteSQL function accepts among other parameters a SQL
select string to run on the database, example “SELECT
PATIENT_AGE, SPECTACLE_PRESCRIPTION, ASTIGMATIC,
TEAR_PRODUCTION_RATE, CONTACT_LENSES FROM DBO_LENSES” and
returns an ADO recordset to the calling function.
modMain:
This
contains the function WriteToXML() that saves an ADO
recordset to XML.
The
function WriteToXML(rst As ADODB.Recordset, strPath As
String) As Boolean requires two parameters;
rst
-
the ADO recordset to save
strPath
-
a valid file path to save the ADO recordset to
c1Rule:
This
class implements the One Rule (1Rule) data mining algorithm
and procedures. The procedures and functions it implements
include:
AddFilter()
Adds a value to a data structure (recordset) if
the value does not already exist in it. The values added will
be used as a filter.
AssignClass()
Calls the CreateValueClass().
ChooseBestRule()
Creates a rule describing the attribute (column in a
database table) with the lowest
error rate.
CreateAddFilter()
Creates a data structure (attribute and value data) that
stores unique values used in filtering other data structures.
CreateClassRules()
Creates a data structure which stores an attribute and
the class it has been assigned to.
CreateDataSource()
Creates data structures for storing an
attribute, its value, the class it is assigned to, its
frequency and total.
CreateValueClass()
Checks if a value and class in a data set exists
in a data structure, if it exists, it increases the frequency
of the value and class in the data structure by 1 and if it
does not exist, it adds the current value, class and frequency
of 1 to the data structure.
FindErrorRate()
Calculates the error rate for each attribute.
Glossary of Data Mining Terms:
Attribute:
A feature of your data set used to measure and describe
it. An example is a column of a database table.
Classification:
A machine learning summary or description of a data
set. The class of an instance of data is the category it
belongs to.
Data
Mining:
Data mining is the process of discovering hidden and
useful patterns and relationships in your data.
Instance:
A specific and individual row in a data set. An example is a
row of data in a database table.
Acknowledgements:
APPENDIX A:
The
Lenses data used for data mining. Each instance is classified
as No_contact_lenses, Soft_contact_lenses or Hard_contact_lenses.
The attributes used for classification are Patient_Age,
Spectacle_Prescription, Astigmatic and Tear_Production_Rate.
Figure
3 - lenses table:
ROW_NUMBER |
PATIENT_AGE |
SPECTACLE_PRESCRIPTION |
ASTIGMATIC |
TEAR_PRODUCTION_RATE |
CONTACT_LENSES |
1 |
Young |
Myope |
No |
Reduced |
No_contact_lenses |
2 |
Young |
Myope |
No |
Normal |
Soft_contact_lenses |
3 |
Young |
Myope |
Yes |
Reduced |
No_contact_lenses |
4 |
Young |
Myope |
Yes |
Normal |
Hard_contact_lenses |
5 |
Young |
Hypermetrophe |
No |
Reduced |
No_contact_lenses |
6 |
Young |
Hypermetrophe |
No |
Normal |
Soft_contact_lenses |
7 |
Young |
Hypermetrophe |
Yes |
Reduced |
No_contact_lenses |
8 |
Young |
Hypermetrophe |
Yes |
Normal |
Hard_contact_lenses |
9 |
Pre-presbyopic |
Myope |
No |
Reduced |
No_contact_lenses |
10 |
Pre-presbyopic |
Myope |
No |
Normal |
Soft_contact_lenses |
11 |
Pre-presbyopic |
Myope |
Yes |
Reduced |
No_contact_lenses |
12 |
Pre-presbyopic |
Myope |
Yes |
Normal |
Hard_contact_lenses |
13 |
Pre-presbyopic |
Hypermetrophe |
No |
Reduced |
No_contact_lenses |
14 |
Pre-presbyopic |
Hypermetrophe |
No |
Normal |
Soft_contact_lenses |
15 |
Pre-presbyopic |
Hypermetrophe |
Yes |
Reduced |
No_contact_lenses |
16 |
Pre-presbyopic |
Hypermetrophe |
Yes |
Normal |
No_contact_lenses |
17 |
Presbyopic |
Myope |
No |
Reduced |
No_contact_lenses |
18 |
Presbyopic |
Myope |
No |
Normal |
No_contact_lenses |
19 |
Presbyopic |
Myope |
Yes |
Reduced |
No_contact_lenses |
20 |
Presbyopic |
Myope |
Yes |
Normal |
Hard_contact_lenses |
21 |
Presbyopic |
Hypermetrophe |
No |
Reduced |
No_contact_lenses |
22 |
Presbyopic |
Hypermetrophe |
No |
Normal |
Soft_contact_lenses |
23 |
Presbyopic |
Hypermetrophe |
Yes |
Reduced |
No_contact_lenses |
24 |
Presbyopic |
Hypermetrophe |
Yes |
Normal |
No_contact_lenses |
APPENDIX A contd:
Figure
4 – dbo_lenses table:
ROW_NUMBER |
PATIENT_AGE |
SPECTACLE_PRESCRIPTION |
ASTIGMATIC |
TEAR_PRODUCTION_RATE |
CONTACT_LENSES |
1 |
1 |
1 |
1 |
1 |
3 |
2 |
1 |
1 |
1 |
2 |
2 |
3 |
1 |
1 |
2 |
1 |
3 |
4 |
1 |
1 |
2 |
2 |
1 |
5 |
1 |
2 |
1 |
1 |
3 |
6 |
1 |
2 |
1 |
2 |
2 |
7 |
1 |
2 |
2 |
1 |
3 |
8 |
1 |
2 |
2 |
2 |
1 |
9 |
2 |
1 |
1 |
1 |
3 |
10 |
2 |
1 |
1 |
2 |
2 |
11 |
2 |
1 |
2 |
1 |
3 |
12 |
2 |
1 |
2 |
2 |
1 |
13 |
2 |
2 |
1 |
1 |
3 |
14 |
2 |
2 |
1 |
2 |
2 |
15 |
2 |
2 |
2 |
1 |
3 |
16 |
2 |
2 |
2 |
2 |
3 |
17 |
3 |
1 |
1 |
1 |
3 |
18 |
3 |
1 |
1 |
2 |
3 |
19 |
3 |
1 |
2 |
1 |
3 |
20 |
3 |
1 |
2 |
2 |
1 |
21 |
3 |
2 |
1 |
1 |
3 |
22 |
3 |
2 |
1 |
2 |
2 |
23 |
3 |
2 |
2 |
1 |
3 |
24 |
3 |
2 |
2 |
2 |
3 |
APPENDIX B: One Rule (1Rule)
Algorithm Flowchart
APPENDIX C:
The following files accompany this article:
1.
A Visual Basic project
2.
A Microsoft Access 2000 database
3.
A HTML version of this artcle
4.
A Microsoft Word 2000 version of this article
5. 3 files named as Index, lenses.data and lenses.names
PLEASE CONTACT MSACCESSGURU.COM
IF THE ACCOMPANYING FILES ARE MISSING.
THE FULL SOURCE CODE AND FILES FOR THE ARTICLE ARE AVAILABLE
FREE AT MSACCESSGURU.COM.
Copyright ©2000-2002
by msaccessguru.com
- All Rights Reserved
Email: public@msaccessguru.com